Introduction: Why Code for Data Science?

Phil Chodrow

Tuesday, August 27th, 2019

What is Data Science?

Some Things That Aren’t Data Science

The Cloud\(^{\mathrm{TM}}\)

Deep Learning

Color names generated by a neural network (credit: Janelle Shane)

BIG DATA!!1!!

Data Science Is:

  • Gathering data that matters.
  • Asking questions that matter about your data.
  • Choosing appropriate methods to answer those questions.
  • Implementing solutions that meets stakeholder needs.

Data Science Tools

You Can Do Data Science With:

  • A pencil and paper
  • A calculator
  • Excel
  • R, Julia, Python….

Why Not Excel?

Why Code?

Why R for Data Analysis?

  • R is the best language in the world for learning data science.
  • R is one of the best languages in the world for doing data science.
  • R tends to be preferred in academia and among “statisticians,” while python is more popular among “computer scientists” and “data scientists”
  • Most practicing data scientists know and use both.

Why Julia and JuMP for Optimization?

  • Julia is high-performance, open-source dynamic language for technical computing – easy writing, fast compute times.
  • Developed at MIT
  • JuMP is a package for optimization in Julia – developed by ORC students!

…yes, there will be an opportunity to learn Python later in the semester.

Learning Goals

What can you pick up in two days?

  • You are not going to become an R or Julia expert in two days.
  • But…
  • You will know the basic concepts and vocabulary of data science – enough to employ the most important skill of all.

The most important skill of all…

The most important skill of all…

Gameplan

  1. Today: Version Control, Basic Data Analysis and Visualization in R
  2. Tomorrow: Optimization in Julia and JuMP, presenting work.
  3. Both days: mini-project, partner work, lots of exercises.

Exercise 0

  1. Look left.
  2. Look right.
  3. Pick a partner (groups of 3 are fine).
  4. Give them a professional, yet friendly smile.
  5. You are going to need them soon.